Hamake: A Data Flow Approach to Data Processing in Hadoop

نویسندگان

  • Vadim Zaliva
  • Vladimir Orlov
چکیده

Most non-trivial data processing scenarios using Hadoop typically involve launching more than one MapReduce job. Usually, such processing is data-driven with the data funneled through a sequence of jobs. The processing model could be expressed in terms of dataflow programming, represented as a directed graph with datasets as vertices. Using fuzzy timestamps as a way to detect which dataset needs to be updated, we can calculate a sequence in which Hadoop jobs should be launched to bring all datasets up to date. Incremental data processing and parallel job execution fit well into this approach. These ideas inspired the creation of the hamake utility. We attempted to emphasize data allowing the developer to formulate the problem as a data flow, in contrast to the workflow approach commonly used. Hamake language uses just two data flow operators: fold and foreach, providing a clear processing model similar to MapReduce, but on a dataset level. I. MOTIVATION AND BACKGROUND Hadoop[1] is a popular open-source implementation of MapReduce, a data processing model introduced by Google[2]. Hadoop is typically used to process large amounts of data through a series of relatively simple operations. Usually Hadoop jobs are I/O-bound [3], [4], and execution of even trivial operations on a large dataset could take significant system resources. This makes incremental processing especially important. Our initial inspiration was the Unix make utility. While applying some of the ideas implemented by make to Hadoop, we took the opportunity to generalize the processing model in terms of dataflow programming. Hamake was developed in late 2008 to address the problem of incremental processing of large data sets in a collaborative filtering project. We’ve striven to create an easy to use utility that developers can start using right away without complex installation or extensive learning curve. Hamake is open source and is distributed under Apache License v2.0. The project is hosted at Google Code at the following URL: http://code.google.com/p/hamake/. II. PROCESSING MODEL Hamake operates on files residing on a local or distributed file system accessible from the Hadoop job. Each file has a timestamp reflecting the date and time of its last modification. A file system directory or folder is also a file with its own timestamp. A Data Transformation Rule (DTR) defines an operation which takes files as input and produces other files as output. If file A is listed as input of a DTR, and file B is listed as output of the same DTR, it is said that “B depends on A.” Hamake uses file time stamps for dependency up-todate checks. DTR output is said to be up to date if the minimum time stamp on all outputs is greater than or equal to the maximum timestamp on all inputs. For the sake of convenience, a user could arrange groups of files and folders into a fileset which could later be referenced as the DTR’s input or output. Hamake uses fuzzy timestamps1 which can be compared, allowing for a slight margin of error. The “fuzziness” is controlled by a tolerance of σ. Timestamp a is considered to be older than timestamp b if (b − a) > σ. Setting σ = 0 gives us a non-fuzzy, strict timestamp comparison. Hamake attempts to ensure that all outputs from a DTR are up to date2 To do so, it builds a dependency graph with DTRs as edges and individual files or filesets as vertices. Below, we show that this graph is guaranteed to be a Directed Acyclic Graph (DAG). After building a dependency graph, a graph reduction algorithm (shown in Figure 1) is executed. Step 1 uses Kahn’s algorithm[5] of topological ordering. In step 6, when the completed DTR is removed from the dependency graph, all edges pointing to it from other DTRs are also removed. The algorithm allows for parallelism. If more than one DTR without input dependencies is found during step 1, the subsequent steps 2-6 can be executed in parallel for each discovered DTR. It should be noted that if DTR exectuion has failed, hamake can and will continue to process other DTRs which do not depend directly or indirectly on the results of this DTR. This permits the user to fix problems later and re-run hamake, without the need to re-process all data. Cyclic dependencies must be avoided, because a dataflow containing such dependencies is not guaranteed to terminate. Implicit checks are performed during the reading of DAG definitions and the building of the dependency graph. If a cycle is detected, it is reported as an error. Thus the dependency 1The current stable version of hamake uses exact (non-fuzzy) timestamps. 2Because hamake has no way to update them, it does not attempt to ensure that files are up to date, unless they are listed as one of a DTR’s outputs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

2016 Olympic Games on Twitter: Sentiment Analysis of Sports Fans Tweets using Big Data Framework

Big data analytics is one of the most important subjects in computer science. Today, due to the increasing expansion of Web technology, a large amount of data is available to researchers. Extracting information from these data is one of the requirements for many organizations and business centers. In recent years, the massive amount of Twitter's social networking data has become a platform for ...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

From Relational Database Management to Big Data: Solutions for Data Migration Testing

Enter Hadoop, the de facto open source standard that is increasingly being used by many companies in large data migration projects. Hadoop is an open-source framework that allows for the distributed processing of large data sets. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. As data from different sources flows into Hadoop,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012